Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcp/saver: Only return errors.KindAlreadyExists if all three exist #1957

Merged
merged 6 commits into from
Jun 2, 2024

Conversation

dwbuiten
Copy link
Contributor

@dwbuiten dwbuiten commented Apr 30, 2024

What is the problem I am trying to address?

In #1124, a GCP lock type was added as a singleflight backend. As part of this work, the GCP storage backend's Save() was made serial, likely because moduploader.Upload requires a call to Exists() before it, rendering the GCP lock less useful, by doubling the calls to GCS.

However, by doing this, the existence check was now only checking the existence of the mod file, and not the info or zip. This meant that if during a Save(), the zip or info uploads failed, on subsequent rquests, that when using the GCP singleflight backend, Athens woul assume everything had been stashed and saved properly, and then fail to serve up the info or zip that had failed upload, meaning the cache was in an unhealable broklen state, requiring a manual intervention.

How is the fix applied?

To fix this, without breaking the singleflight behavior, introduce a metadata key that is set on the mod file during its initial upload, indicating that a Stash is still in progress on subsequent files, which gets removed once all three files are uploaded successfully, which can be checked if it it is determined that the mod file already exists. That way we can return a errors.KindAlreadyExists if a Stash is in progress, but also properly return it when a Stash is not currently in progress if and only if all three files exist on GCS, which prevents the cache from becoming permanently poisoned.

One note is that it is possible the GCS call to remove the metadata key fails, which would mean it is left on the mod object forever. To avoid this, consider it stale after 2 minutes.

What GitHub issue(s) does this PR fix or close?

N/A

Extra Notes

This is my first time touching the Athens codebase, so apologies for any faux pas. We hit this at $dayjob, so thought I would take a stab at fixing it.

It is a more complex than I would like, so comments very welcome.

This also contains a small fix I found when writing the test: if io.Copy failed during an upload to GCS, it would call Close() on the writer, which would leave a partial/broken file. The storage docs say to cancel the context to prevent this.

@dwbuiten dwbuiten requested a review from a team as a code owner April 30, 2024 18:04
@matt0x6F
Copy link
Contributor

matt0x6F commented May 1, 2024

@dwbuiten thanks for your contribution to Athens! I'm doing a release today, but I'll get to reviewing this and we'll see if we can get it into one of the fix releases.

Just off the bat, one thing that'd help show what this is fixing is if you're able to write tests to it in E2E or unit tests.

@dwbuiten
Copy link
Contributor Author

dwbuiten commented May 2, 2024

@matt0x6F Sure - adding a unit test to with_gcs_test.go. And it is a good thing too... I noticed the test there does not run by default (without GCP creds in the env), and when I ran the existing test, I realized this fix actually breaks singleflight, while fixing the poison cache problem. I need to ponder a bit on the subject, but will update this PR with a different fix + test after.

@dwbuiten dwbuiten force-pushed the gcp-cache-poison branch 2 times, most recently from b637dc4 to 2da5d26 Compare May 2, 2024 20:08
@dwbuiten
Copy link
Contributor Author

dwbuiten commented May 2, 2024

@matt0x6F It ended up being more complex than I thought (or would like it to be), but old and new tests all pass now. I updated the PR description to reflect that.

CCing @dfinkel who helped me think this through.

@dwbuiten dwbuiten force-pushed the gcp-cache-poison branch from 2da5d26 to 13a401c Compare May 2, 2024 22:55
dwbuiten added 3 commits May 2, 2024 23:57
In gomods#1124, a GCP lock type was added as a singleflight backend. As
part of this work, the GCP backend's Save() was made serial, likely
because moduploader.Upload requires a call to Exists() before it,
rendering the GCP lock less useful, by doubling the calls to GCS.

However, by doing this, the existence check was now only checking
the existence of the mod file, and not the info or zip. This meant
that if during a Save, the zip or info uploads failed, on subsequent
rquests, that when using the GCP singleflight backend, Athens would
assume everything had been stashed and saved properly, and then
fail to serve up the info or zip that had failed upload, meaning
the cache was in an unhealable broklen state, requiring a manual
intervention.

To fix this, without breaking the singleflight behavior, introduce
a metadata key that is set on the mod file during its initial upload,
indicating that a Stash is still in progress on subsequent files,
which gets removed once all three files are uploaded successfully,
which can be checked if it it is determined that the mod file
already exists. That way we can return a errors.KindAlreadyExists
if a Stash is in progress, but also properly return it when a Stash
is *not* currently in progress if and only if all three files exist
on GCS, which prevents the cache from becoming permanently poisoned.

One note is that it is possible the GCS call to remove the metadata
key fails, which would mean it is left on the mod object forever.
To avoid this, consider it stale after 2 minutes.

Signed-off-by: Derek Buitenhuis <[email protected]>
This is the proper way to make sure we do not write a partial file.

Signed-off-by: Derek Buitenhuis <[email protected]>
@dwbuiten dwbuiten force-pushed the gcp-cache-poison branch from 13a401c to aa92813 Compare May 2, 2024 22:57
Copy link
Contributor

@matt0x6F matt0x6F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have an outstanding question regarding user configurability, but overall I think the PR looks great. I'm also glad we wrote some tests out. It makes me wonder if the GCP Storage ever reliably worked without these changes.

pkg/storage/gcp/saver.go Outdated Show resolved Hide resolved
@dwbuiten dwbuiten force-pushed the gcp-cache-poison branch 2 times, most recently from 2b75213 to 839eda3 Compare May 8, 2024 17:15
@dwbuiten dwbuiten requested a review from matt0x6F May 8, 2024 17:18
This is useful if, for example, the user is on a slow network.

Signed-off-by: Derek Buitenhuis <[email protected]>
@dwbuiten dwbuiten force-pushed the gcp-cache-poison branch from 839eda3 to 1814b41 Compare May 8, 2024 17:52
Copy link
Contributor

@matt0x6F matt0x6F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding the configuration. We'll get this out in our next feature release.

@matt0x6F matt0x6F enabled auto-merge (squash) June 2, 2024 19:23
@matt0x6F matt0x6F merged commit 0ef761c into gomods:main Jun 2, 2024
11 checks passed
@matt0x6F matt0x6F added this to the 0.14.1 milestone Jun 2, 2024
matt0x6F referenced this pull request in gomods/athens-charts Jun 3, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [gomods/athens](https://togithub.com/gomods/athens) | patch |
`v0.14.0` -> `v0.14.1` |

---

### Release Notes

<details>
<summary>gomods/athens (gomods/athens)</summary>

### [`v0.14.1`](https://togithub.com/gomods/athens/releases/tag/v0.14.1)

[Compare
Source](https://togithub.com/gomods/athens/compare/v0.14.0...v0.14.1)

#### What's Changed

- CI dependency updates in
[https://github.com/gomods/athens/pull/1962](https://togithub.com/gomods/athens/pull/1962)
- Fix for AWS default credentials by
[@&#8203;yongzhang](https://togithub.com/yongzhang) in
[https://github.com/gomods/athens/pull/1963](https://togithub.com/gomods/athens/pull/1963)
- Fix for E2E test runs by
[@&#8203;joeymhills](https://togithub.com/joeymhills) in
[https://github.com/gomods/athens/pull/1966](https://togithub.com/gomods/athens/pull/1966)
- Feature gcp/saver: Only return errors.KindAlreadyExists if all three
exist by [@&#8203;dwbuiten](https://togithub.com/dwbuiten) in
[https://github.com/gomods/athens/pull/1957](https://togithub.com/gomods/athens/pull/1957)
- Fix to set correct content type and send only once by
[@&#8203;matt0x6F](https://togithub.com/matt0x6F) in
[https://github.com/gomods/athens/pull/1965](https://togithub.com/gomods/athens/pull/1965)

#### New Contributors

- [@&#8203;yongzhang](https://togithub.com/yongzhang) made their first
contribution in
[https://github.com/gomods/athens/pull/1963](https://togithub.com/gomods/athens/pull/1963)
- [@&#8203;joeymhills](https://togithub.com/joeymhills) made their first
contribution in
[https://github.com/gomods/athens/pull/1966](https://togithub.com/gomods/athens/pull/1966)
- [@&#8203;dwbuiten](https://togithub.com/dwbuiten) made their first
contribution in
[https://github.com/gomods/athens/pull/1957](https://togithub.com/gomods/athens/pull/1957)

**Full Changelog**:
gomods/athens@v0.14.0...v0.14.1

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/gomods/athens-charts).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zNzcuOCIsInVwZGF0ZWRJblZlciI6IjM3LjM3Ny44IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants